library(tidyverse)
library(igraph)
library(tidygraph)
library(ggraph)Analyzing Retweet Dynamics of German MPs in the 19th Legislative Period
Assignment 2: Data Analysis Exercise
In a data analysis exercise a real-world, but pre-defined and cleaned data set, is assigned from a small collection of examples. A list of descriptive measures, as well as community detection methods should be applied, using a programming language and libraries of choice (which have been introduced in the course before). In a short report, either including the code (e.g., a python notebook) or a text document with plots (in this case the code should be submitted additionally), the results should be presented and their meaning interpreted and discussed with respect to the dataset at hand. Students will have two weeks to solve this task individually at home. Grading will be one half the correct application of the descriptive measures and the other half the presentation (i.e., plots), the interpretation and the discussion of those results.
Load and Manipulate Dataset
In my Master’s Thesis, I scale German MPs (MdBs) of the 19th legislative period across ideological dimensions that are determined in an unsupervised way by using parliamentary speeches and Tweets. While retweet dynamics do not play a big role in the scaling algorithm per se, party affiliation does and is particularly interesting for network analytical approaches, especially in the realm of clustering.
tweet_df_final <- readRDS("data/tweet_df_final_2.rds")
# further meta information
od_tw_info <- read_csv("data/od_tw_merge_final.csv")How many Tweets are there in total? Over 1.3 Million - quite a lot.
tweet_df_final %>%
nrow()[1] 1301427
How many Twitter users are there in total across all MPs of the 19th legislative period?
tweet_df_final %>%
select(author_id, tw_id) %>%
distinct() %>%
nrow()[1] 515
Create data.frame that exhibits information of referenced Tweets
retweet_df <- tweet_df_final %>%
select(referenced_tweets, tweet_type_character, tweet_type_final, id) %>%
mutate(
# type of referenced tweet
ref_tweet_types = map(1:length(referenced_tweets),
~ referenced_tweets[[.x]][["type"]]),
# id of referenced tweet (NOT id of user)
ref_tweet_id = map(1:length(referenced_tweets),
~ referenced_tweets[[.x]][["id"]])
) Create final data.frame that only consists of retweets that occured within the dataset
Filter for retweets and merge author information from retweeted Tweets.
# vector with IDs from Tweets that are from authors within the dataset
tweet_ids <- tweet_df_final$id
retweet_within_df <- retweet_df %>%
filter(ref_tweet_types == "retweeted") %>%
filter(ref_tweet_id %in% tweet_ids) %>%
left_join(tweet_df_final %>% select(id, author_id, tw_id)) %>%
# directedness: from (orginal tweet) -> to (retweeted tweet)
left_join(author_id_ref_tweet, by = "id", suffix = c("_to", "_from"))How many Tweets are left after constructing the “within-retweet data.frame”? Just about 100000 or 7.7% of all Tweets.
retweet_within_df %>% nrow()[1] 100311
paste0((nrow(retweet_within_df) / nrow(tweet_df_final) * 100) %>% round(1), "%")[1] "7.7%"
Validate Retweet-Network Structure
retweet_within_df %>% select(tw_id_from, tw_id_to) %>% head(5) tw_id_from tw_id_to
1 katjakipping dorisachelwilm
2 victorperli dorisachelwilm
3 niemamovassat dorisachelwilm
4 michel_brandt_ dorisachelwilm
5 dietmarbartsch dorisachelwilm
This makes actually sense! Doris Achelwilm (MP from Left Party) retweets Katja Kipping and Dietmar Bartsch, which do both play a central role in the Left Party: They were party leaders in the 19th legislative period.
Plot most retweeted MPs
First, let us create a vector with party colors.
party_colors <- c("#138BD8", "#000000", "#FFEE0A", "#529222", "#AE1862", "#E30019")
names(party_colors) <- c("AfD", "CDU/CSU", "FDP", "Grüne", "DIE LINKE.", "SPD")We also create a ggplot2 theme, which is based on theme_bw(), but has some refinements.
theme_custom <- function() {
font <- "Corbel"
# base theme
theme_bw() %+replace%
theme(
text = element_text(family = font),
legend.text = element_text(size = 9),
legend.title = element_blank(),
legend.box.background = element_rect(
colour = "black",
fill = "white",
linetype = "solid"
),
# grid lines
panel.grid.major = element_line(color = "grey60", size = 0.2),
panel.grid.minor = element_line(color = "grey80", size = 0.1),
# faceting
strip.background = element_blank(),
strip.text = element_text(color = "black"),
)
}Now, we set the new theme as default.
theme_set(theme_custom())We also create a data.frame that just includes the party affiliation per MP.
party_user <- tweet_df_final %>%
distinct(tw_id, party)Plot Top 20
Now, we count how often per MP a Tweet was retweeted by another MP. We plot the to 20 MPs in total and visualize the party affiliation.
retweet_within_df %>%
group_by(tw_id_from) %>%
count() %>%
ungroup() %>%
arrange(desc(n)) %>%
top_n(20) %>%
left_join(party_user, by = c("tw_id_from" = "tw_id")) %>%
ggplot(aes(x = n, y = reorder(tw_id_from, n), fill = party)) +
geom_col() +
scale_fill_manual(values = party_colors) +
labs(y = "", title = "Most Prominent MPs by Retweets of other MPs") +
theme_custom()Interestingly, the AfD and FDP dominate the chart, while there is no single SPD MP within the top 20 most retweeted MPs within. Let’s check the pattern and plot by party! Overall, the results seem to make sense, given that party leaders and influential MPs are the most retweeted MPs according to the bar plot.
Plot Top 20 by Party
Now, let’s turn the attention to the most retweeted MPs by party. Both CDU/CSU and SPD, the biggest parties and also only members of the governing coalition of the 19th legislative period do not have as many influential MPs, measured by number of retweets across their colleagues, as opposition parties have.
retweet_within_df %>%
left_join(party_user, by = c("tw_id_from" = "tw_id")) %>%
filter(!party %in% c("Fraktionslos", NA)) %>%
group_by(tw_id_from, party) %>%
count() %>%
group_by(party) %>%
arrange(desc(n)) %>%
top_n(10) %>%
ggplot(aes(x = n, y = reorder(tw_id_from, n), fill = party)) +
geom_col() +
facet_wrap(~ party, scales = "free_y", ncol = 2) +
scale_fill_manual(values = party_colors) +
labs(y = "", title = "Most Prominent MPs by Retweets of other MPs", subtitle = "Grouped by Party") +
theme_custom()Construct Retweet-Network
In a Retweet-Network each retweet is represented as a directed link in a network that connects from the original Tweet to the referenced retweet. Users that are often retweeted within the network are expected to be associated with some degree of (digital) political relevance, even though there a lot of potential confounding variables (e.g. number of Tweets, number of Followers, general social media behaviour and association with socio-demographic factors such as cohort affiliation).
retweet_network <- retweet_within_df %>%
select(from = tw_id_from, to = tw_id_to) %>%
as_tbl_graph(directed = TRUE) %>%
left_join(party_user, by = c("name" = "tw_id"))
# get corresponding row ID (from ID)
row_id_from <- retweet_network %>%
as_tibble() %>%
mutate(id_from = row_number())
# update retweet network with edge information: name and party of retweeted users (from)
retweet_network <- retweet_network %>%
activate(edges) %>%
left_join(row_id_from, by = c("from" = "id_from")) %>%
rename(party_from = party, name_from = name) %>%
activate(nodes)
retweet_network# A tbl_graph: 487 nodes and 100311 edges
#
# A directed multigraph with 1 component
#
# Node Data: 487 × 2 (active)
name party
<chr> <chr>
1 katjakipping DIE LINKE.
2 victorperli DIE LINKE.
3 niemamovassat DIE LINKE.
4 michel_brandt_ DIE LINKE.
5 dietmarbartsch DIE LINKE.
6 susanneferschl DIE LINKE.
# … with 481 more rows
#
# Edge Data: 100,311 × 4
from to name_from party_from
<int> <int> <chr> <chr>
1 1 44 katjakipping DIE LINKE.
2 2 44 victorperli DIE LINKE.
3 3 44 niemamovassat DIE LINKE.
# … with 100,308 more rows
Degree Centrality
Out-degree
In the previous plots, we already calculated the out-degree centrality (which is the same as the number of outgoing edges per node or the number of original Tweets that have been retweeted within). Nevertheless, double-checking is always a good practice!
# %N>% activate nodes (tidygraph syntax)
retweet_network %N>%
mutate(
# in_degree = centrality_degree(mode = "in"),
out_degree = centrality_degree(mode = "out")
) %>%
arrange(desc(out_degree)) %>%
select(name, out_degree, party) %>%
as_tibble() %>%
head(20) %>%
knitr::kable()| name | out_degree | party |
|---|---|---|
| c_lindner | 4066 | FDP |
| udohemmelgarn | 3983 | AfD |
| alice_weidel | 3853 | AfD |
| frank_pasemann | 3794 | AfD |
| brihasselmann | 2475 | Grüne |
| marcobuschmann | 2294 | FDP |
| schneider_afd | 1830 | AfD |
| beatrix_vstorch | 1554 | AfD |
| gtzfrmming | 1330 | AfD |
| dietmarbartsch | 1208 | DIE LINKE. |
| m_reichardt_afd | 1164 | AfD |
| konstantinkuhle | 1162 | FDP |
| olliluksic | 1154 | FDP |
| paulziemiak | 1022 | CDU/CSU |
| joanacotar | 1021 | AfD |
| konstantinnotz | 956 | Grüne |
| renner_afd | 933 | AfD |
| nicole_hoechst | 921 | AfD |
| jensspahn | 901 | CDU/CSU |
| johannesvogel | 888 | FDP |
Now, let’s plot the distribution of the out-degree. There are a few outliers at the upper end.
retweet_network %N>%
mutate(
# in_degree = centrality_degree(mode = "in"),
out_degree = centrality_degree(mode = "out")
) %>%
as_tibble() %>%
ggplot(aes(x = out_degree)) +
geom_histogram() +
labs(x = "Out-Degree") +
theme_custom()The out-degree distribution is quite comparable across parties: Most MPs were never retweeted, but have important retweet hubs. However, for the Green and Left Party, the distribution seems to be a bit less skewed.
retweet_network %N>%
filter(!party %in% c("Fraktionslos", NA)) %>%
mutate(
# in_degree = centrality_degree(mode = "in"),
out_degree = centrality_degree(mode = "out")
) %>%
as_tibble() %>%
ggplot(aes(x = out_degree, fill = party)) +
geom_histogram() +
labs(x = "Out-Degree") +
facet_wrap(~ party, scales = "free") +
scale_fill_manual(values = party_colors) +
theme_custom()In-degree
The in-degree corresponds to the “inverse retweet behaviour”: It indicates which MPs retweeted Tweets of other MPs.
# %N>% activate nodes (tidygraph syntax)
retweet_network %N>%
mutate(
in_degree = centrality_degree(mode = "in"),
# out_degree = centrality_degree(mode = "out")
) %>%
arrange(desc(in_degree)) %>%
select(name, in_degree, party) %>%
as_tibble() %>%
head(20) %>%
knitr::kable()| name | in_degree | party |
|---|---|---|
| udohemmelgarn | 6002 | AfD |
| schneider_afd | 5001 | AfD |
| frank_pasemann | 3583 | AfD |
| martinrosemann | 3409 | SPD |
| renner_afd | 3270 | AfD |
| olliluksic | 2140 | FDP |
| nicole_hoechst | 1735 | AfD |
| drandreasnick | 1567 | CDU/CSU |
| stbrandner | 1519 | AfD |
| rkiesewetter | 1505 | CDU/CSU |
| reinholdmdb | 1494 | FDP |
| c_jung77 | 1284 | FDP |
| k_sa | 1225 | Grüne |
| elsnervongronow | 1187 | AfD |
| tschipanski | 1006 | CDU/CSU |
| ullinissen | 991 | SPD |
| andi_wagner | 969 | DIE LINKE. |
| gtzfrmming | 868 | AfD |
| djanecek | 834 | Grüne |
| steinekecdu | 807 | CDU/CSU |
retweet_network %N>%
mutate(
in_degree = centrality_degree(mode = "in"),
# out_degree = centrality_degree(mode = "out")
) %>%
as_tibble() %>%
ggplot(aes(x = in_degree)) +
geom_histogram() +
labs(x = "In-Degree") +
theme_custom()retweet_network %N>%
filter(!party %in% c("Fraktionslos", NA)) %>%
mutate(
in_degree = centrality_degree(mode = "in"),
# out_degree = centrality_degree(mode = "out")
) %>%
as_tibble() %>%
ggplot(aes(x = in_degree, fill = party)) +
geom_histogram() +
labs(x = "In-Degree") +
facet_wrap(~ party, scales = "free") +
scale_fill_manual(values = party_colors) +
theme_custom()The in-degree distribution seems to be a bit more normally distributed, meaning that “inverse hubs”, or MPs that heavily retweeted other MPs, are also present but not as dominant as retweeted MP hubs. However, the AfD seems to dominate the retweet dynamic from this perspective, as the table above shows.
Comparing Out- and In-degree
In- and out-degree seem to be correlated (\(R = 0.53\)), however, there are outliers that influence the relationship heavily.
# %N>% activate nodes (tidygraph syntax)
retweet_network %N>%
mutate(
in_degree = centrality_degree(mode = "in"),
out_degree = centrality_degree(mode = "out")
) %>%
select(name, in_degree, out_degree) %>%
as_tibble() %>%
left_join(party_user, by = c("name" = "tw_id")) %>%
ggplot(aes(out_degree, in_degree)) +
geom_point(alpha = 0.5) +
ggpubr::stat_cor() +
theme_custom()It seems that there are a few outliers at the upper end of the distribution. Let’s see whether the correlation holds when they are removed.
The correlation is still quite high (\(R = 0.38\)) and highly significant, when we remove the MPs with the top 20 highest in- and out-degrees.
top_20_in <- retweet_network %>%
mutate(in_degree = centrality_degree(mode = "in")) %>%
arrange(desc(in_degree)) %>%
as_tibble() %>%
top_n(20) %>%
pull(name)
top_20_out <- retweet_network %>%
mutate(out_degree = centrality_degree(mode = "out")) %>%
arrange(desc(out_degree)) %>%
as_tibble() %>%
top_n(20) %>%
pull(name)
# %N>% activate nodes (tidygraph syntax)
retweet_network %N>%
mutate(
in_degree = centrality_degree(mode = "in"),
out_degree = centrality_degree(mode = "out")
) %>%
select(name, in_degree, out_degree) %>%
as_tibble() %>%
left_join(party_user, by = c("name" = "tw_id")) %>%
filter(!name %in% top_20_in, !name %in% top_20_out) %>%
ggplot(aes(out_degree, in_degree)) +
geom_point(alpha = 0.5) +
ggpubr::stat_cor() +
theme_custom()retweet_network %N>%
filter(!party %in% c("Fraktionslos", NA)) %>%
mutate(
in_degree = centrality_degree(mode = "in"),
out_degree = centrality_degree(mode = "out")
) %>%
select(name, in_degree, out_degree) %>%
as_tibble() %>%
left_join(party_user, by = c("name" = "tw_id")) %>%
filter(!name %in% top_20_in, !name %in% top_20_out) %>%
ggplot(aes(out_degree, in_degree)) +
geom_point(alpha = 0.5, aes(color = party)) +
ggpubr::stat_cor(show.legend = F) +
facet_wrap(~ party) +
scale_color_manual(values = party_colors) +
theme_custom()Distance / Shortest Path Length
Most of the users are connected across (a path length of) 2 or 3 retweets.
distances_retweet_network <- retweet_network %>%
distances() %>%
table() %>%
as_tibble() %>%
rename(Distance = 1)
distances_retweet_network %>%
knitr::kable()| Distance | n |
|---|---|
| 0 | 487 |
| 1 | 18554 |
| 2 | 105456 |
| 3 | 101042 |
| 4 | 11290 |
| 5 | 332 |
| 6 | 8 |
distances_retweet_network %>%
ggplot(aes(x = Distance, y = n)) + geom_col() +
ggtitle("Distribution of Distance (Shortest Path)") +
theme_custom()Network Visualization
Create a new custom theme, based on ggraph::theme_graph().
theme_graph_custom <- function() {
font <- "Corbel"
# base theme
theme_graph() %+replace%
theme(
text = element_text(family = font),
legend.text = element_text(size = 9),
legend.title = element_blank(),
legend.box.background = element_rect(
colour = "black",
fill = "white",
linetype = "solid"
),
# faceting
strip.background = element_blank(),
strip.text = element_text(color = "black"),
)
}Standard Plot
Using ggraph auto/default options to plot all nodes and edges, we can see that there is a strong pattern of “intra-party retweeting”. Interestingly, while most parties members exhibits almost exclusively retweet within the party, there is a strong visual intersection of SPD and Left Party retweets. To a lesser extent, this also repeats for party of the Greens with the SPD
plot_all_auto <- retweet_network %>%
ggraph(layout = "auto") +
geom_edge_link(alpha = 0.015) +
geom_node_point(aes(color = party)) +
scale_color_manual(values = party_colors) +
theme_graph_custom()
plot_all_autoCircular
Now let’s visualize the network in a circular way, highlighting the retweet dynamics in a way that underlines the party structure of the dataset.
plot_all_circular <- retweet_network %>%
filter(!party %in% c("Fraktionslos", NA)) %>%
ggraph(layout = "linear", circular = TRUE) +
geom_edge_arc(aes(color = party_from), alpha = 0.015) +
scale_edge_color_manual(values = party_colors) +
coord_fixed() +
theme_graph_custom() +
theme(legend.position = "none")
plot_all_circularFurther Centrality Measures
Closeness Centrality
The Closeness Centrality is defined as the inverse total distance of paths between a node and all other nodes in the network. Shorter paths to others are thus an indicator of node centrality.
closeness_retweet_network <- retweet_network %>%
mutate(Closeness = centrality_closeness())The Closeness Centrality is not dominated by the AfD, which was exhibiting the MPs with the highest degrees. This finding underlines that centrality measures that go beyond the degree centrality play an important role when assessing the actual dynamics of the dataset.
closeness_retweet_network %>%
filter(Closeness != 1) %>%
arrange(desc(Closeness)) %>%
select(name, Closeness, party) %>%
as_tibble() %>%
head(20) %>%
knitr::kable()| name | Closeness | party |
|---|---|---|
| konstantinkuhle | 0.0011416 | FDP |
| karl_lauterbach | 0.0010776 | SPD |
| brihasselmann | 0.0010672 | Grüne |
| marcobuschmann | 0.0010616 | FDP |
| heikomaas | 0.0010604 | SPD |
| larsklingbeil | 0.0010504 | SPD |
| johannesvogel | 0.0010194 | FDP |
| rkiesewetter | 0.0010194 | CDU/CSU |
| renatekuenast | 0.0010183 | Grüne |
| thomasoppermann | 0.0010183 | SPD |
| mgrossebroemer | 0.0010163 | CDU/CSU |
| c_lindner | 0.0010152 | FDP |
| konstantinnotz | 0.0010152 | Grüne |
| schneidercar | 0.0010121 | SPD |
| katarinabarley | 0.0010121 | SPD |
| nouripour | 0.0010101 | Grüne |
| mastrackzi | 0.0010091 | FDP |
| sigmargabriel | 0.0010091 | SPD |
| lambsdorff | 0.0010070 | FDP |
| cem_oezdemir | 0.0010040 | Grüne |
The closeness centrality seems to be actually normally distributed, while there are some outliers on the lower bound.
closeness_retweet_network %>%
filter(Closeness != 1) %>%
as_tibble() %>%
ggplot(aes(x = Closeness)) +
geom_density() +
theme_custom()While most parties have an approximately equally distributed Closeness Centrality, the AfD has the lowest one with a visually perceivable difference. For interpretational reasons, this is likelly due to internal heterogeneity, given the populist character and short existence period of the party.
closeness_retweet_network %>%
filter(!party %in% c("Fraktionslos", NA)) %>%
filter(Closeness != 1) %>%
as_tibble() %>%
ggplot(aes(x = Closeness, y = reorder(party, Closeness), fill = party)) +
ggridges::stat_density_ridges(alpha = 0.5, quantile_lines = T, quantiles = 4) +
scale_fill_manual(values = party_colors) +
theme_custom() +
labs(y = "")Betweenness Centrality
TODO: explain betweenness centrality
betweenness_retweet_network <- retweet_network %>%
mutate(Betweenness = centrality_betweenness())betweenness_retweet_network %>%
arrange(desc(Betweenness)) %>%
select(name, Betweenness, party) %>%
as_tibble() %>%
head(20) %>%
knitr::kable()| name | Betweenness | party |
|---|---|---|
| tschipanski | 26989.810 | CDU/CSU |
| s_muenzenmaier | 19464.496 | AfD |
| stbrandner | 18131.114 | AfD |
| olliluksic | 14463.042 | FDP |
| rkiesewetter | 12089.569 | CDU/CSU |
| _martinneumann | 8452.709 | FDP |
| fabiodemasi | 8202.163 | DIE LINKE. |
| joanacotar | 8010.589 | AfD |
| hubertus_heil | 7998.812 | SPD |
| c_lindner | 7090.924 | FDP |
| gtzfrmming | 6647.937 | AfD |
| timon_gremmels | 6566.856 | SPD |
| wanderwitz | 6504.270 | CDU/CSU |
| matthiashauer | 6361.569 | CDU/CSU |
| udohemmelgarn | 6303.254 | AfD |
| mgrossebroemer | 6059.420 | CDU/CSU |
| sven_kindler | 5791.922 | Grüne |
| larsklingbeil | 5217.861 | SPD |
| drandreasnick | 4975.422 | CDU/CSU |
| konstantinnotz | 4778.380 | Grüne |
betweenness_retweet_network %>%
as_tibble() %>%
ggplot(aes(x = Betweenness)) +
geom_histogram() +
theme_custom()betweenness_retweet_network %>%
as_tibble() %>%
ggplot(aes(x = Betweenness)) +
geom_histogram() +
scale_x_log10() +
theme_custom()betweenness_retweet_network %>%
filter(!party %in% c("Fraktionslos", NA)) %>%
filter(Betweenness != 1) %>%
as_tibble() %>%
ggplot(aes(x = Betweenness, color = party)) +
facet_wrap(~ party) +
geom_density() +
scale_x_log10() +
# geom_density() +<
scale_color_manual(values = party_colors) +
theme_custom() +
labs(y = "")Selecting Subgraph for Visualization
Selecting a subgraph (sub-network) based on the largest component.
Community Detection
Since we are dealing with a directed network, not every type of community detection algorithm is appropriate.
retweet_network_group_betweenness <- retweet_network %>%
mutate(group = group_edge_betweenness(n_groups = 6))